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ABSTRACT 

This document summarizes the activities conducted 
under the National institute of Education (NIE) planning grant for 
the new Center on Student Testing, Evaluation, and Standards and the 
conceptualizations that emerged from these activities. Chapter 1 
presents a summary of the planning activities actually conducted 
under the award, including particular problems and successes and a 
list of participants and their affiliations. Chapter 2 provides a 
technical report on the Research and Development mission for a Center 
on Student Testing, Evaluation and Standards, including a brief 
review of the literature, analysis of problems in practice, guiding 
themes for the research agenda and effective strategies for 
conducting the research. Chapter 3 is a futures paper which 
summarizes in nontechnical language the proposed Center's mission, 
long rangs plans, and objectives. A four-page list of references 
concludes the document. ( Author/ JAZ) 
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CHAPTER ONE: Summary of Planning Activities 



Actual activities conducted under the planning grant varied from those 
Initially planned due to the delay In the competition and the additional 
guidelines provided by the NIE for the mission area. As a result of these 
NIE Instituted changes, the proposed mission needed to be amended from the 
planning grant proposal. Instead of devoting the planning process to 
soliciting review of the mission proposed In the planning grant, then, 
substantial time and effort had to be allocated to the mission amendment. 
While the amendment abbreviated the wide and repeated review process 
Initially Intended, the planning process nonetheless gathered wide Input 
from a number of stakeholders with Interests In educational testing and 
evaluation. In the sections which follow, we provide a chronology of the 
planning process, describing particular problems and successes as they 
occurred, and a lostlng of the participants In the process. Please note 
that these planning activities benefitted from University contributed 
resources as well as from funds granted by the NIE. 

Chronology of Planning Activities 

Planning was comprised of five major activities. These Included 
collaboration with members of the National Faculty to get their feedback on 
priority areas for the research mission and effective strategies for the 
conduct of R&D; mission and research planning by members of the Research 
Council and participating faculty and staff; review of mission and research 
plans by noted researchers and practitioners followed by revisions as 
necessary; and plannlnc for collaboration with other laboratories, centers, 
and state and local practitioners and policymakers. Specific activities 
within each of these areas are described below. 

Collaboration with members of the National Faculty to get feedback on 
priority areas for research . In the planning proposal, we advanced the 
Idea of a National Faculty of Interested practitioners and policymakers who 
would collaborate with us during all stages of the research process, from 
planning through dissemination; and In fact conversations with some of 
these Individuals Influenced the perspectives described In the planning 
proposal. Once the planning award was granted, an Initial task was to get 
systematic and specific feedback on the mission and research we proposed 
from a National Faculty composed of members r*» r -esentlng the full range of 
Interests, Including teachers, administrator tool board members, state 
and local superintendents, state and local d 1 .ors of research and evalu- 
ation, military trainers, and test publishers. A meeting of representa- 
tives from each of these groups was convened at the annual meeting of the 
American Educational Research Association In April, 1985 as a first step In 
this feedback process. The proposed mission was discussed as were Ideas 
and specific projects. National Faculty members were given copies of the 
proposed mission and questionnaires for soliciting their feedback on the 
following Issues: 

- Overall Importance of the problems addressed by the mission; 

- Additional problems that ought to be addressed; 

- Probable effectiveness of strategies for conducting RDD&E; 
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- Additional strategies that should be Included; 

- The single most Important objective that should be addressed by a 
national center on student testing, evaluation, and standards; 

- The Importance of the Issues emphasized In each of the proposed 
research programs, and the need for additions, deletions and/or 
modifications; 

- Reactions to each of the potential research Initiatives In terms of 
Its relevance to the mission; Importance of the problem addressed; 
potential Impact on policy /practice; likelihood of success. 

In addition to providing their own reactions, members also were asked 
to solicit the reactions of their peers and to report to us accordingly. 
They were asked to provide such feedback by May 31. 

This feedback was summarized quantitatively and qualitatively for the 
June deliberations of the Research Council (see below). Reactions from the 
National Faculty were uniformly positive with regard to the Importance of 
the mission and the probable effectiveness of proposed RDD4E strategies. 
They highly rated the proposed research programs and were generally 
favorable toward all of the research Initiatives. Respondents appeared 
most positive about R&D related to the use of testing for Instructional 
Improvement and about gaining knowledge about how to deal with practical 
problems. 

Members of the National Faculty generally were enthusiastic about the 
opportunity to offer their views and to collaborate In research planning. 
In some cases, particularly at the highest administrative levels, 
enthusiasm exceeded available time to carefully review proposal documents 
and to respond In depth to them within time constraints. In these cases, 
feedback was more Informal, through personal Interaction and conversation. 
Other sources of feedback Included Informal meetings at conferences 
attended during the planning period, e.g., the ECS conference at Boulder 
provided an opportunity to meet with many state level decisionmakers, a 
College Board sponsored equity conference enabled meetings with local and 
state administrators and subject matter experts In a range of disciplines. 

Mission and research project planning by the Research Council . As 
pr oposed in the planning proposal, a Research Council composed of Ce n te r 
directors, program directors, and representatives of each collaborating 
Institution (University of Illinois, University of Colorado, National 
Opinion Research Center at the University of Chicago, and Educational 
Testing Service) was to be central In clarifying the Center's mis ion and In 
making decisions about what projects to fund during the first proposal 
period. A meeting of this group was convened at UCLA on March 14-15, 1985 
to reach consensus on directions for the mission, program organization, and 
to hear presentations on high priority research Initiatives that might be 
funded; management structure for the new NIE Center and schedules for 
completing and reviewing drafts of the proposal were also discussed. 
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The Research Council was scheduled to meet with the National Faculty 
during the annual meeting of the American Educational Research Association 
and subsequently to make final decisions about the mission and research 
projects. While members of the Research Council did meet with the National 
Faculty about general mission directions, decisions about specific projects 
to be funded and their planning were postponed until after Secretary 
Bennett revised the mission area and published suggested modifications. 

Subsequent to Secretary Bennett's announced modifications, Research 
Council members as well as Interested faculty and staff were requested to 
submit additional Ideas for research projects, Including a description of 
the problem to be addressed, Its significance In relation to mission, 
proposed methodology, and budget requirements. These proposals were 
presented to the Research Council at a meeting held on June 4-5, 1985 at 
UCLA. 



At this meeting, the Research Council considered the reactions of the 
National Faculty to the proposed mission and the modifications suggested by 
Secretary Bennett. Based on their deliberations, they cached consensus on 
the revised mission which was to guide the research programs, Including 
major themes and Center objectives. After agreeing on the mission, the 
Research Council considered each proposed research project In relation to 
the Center mission and objectives. Its potential contribution to the 
Improvement of practice and to the development of theory and understanding 
of fundamental Issues, Its Intellectual rigor, and Its possible 
Interrelationships with other proposed projects. Discussion focused on 
project options and modifications which would Increase the coherence of t> 
proposed projects within and across programs and/or which might be most 
cost effective In producing a balanced overall program of research. Based 
on these discussions, the directors of the proposed Center made 
recommendations for the Initial slate of projects to be funded and the 
resources to be allocated to each. The Research Council concurred with the 
Directors 1 recommendations. After reaching agreement on projects, team 
meetings by program were held to refine key themes and objectives for each 
program, to specify program study teams for future projects, and to discuss 
Interrelationships between projects and ways to facilitate aggregation of 
findings. Responsibilities and schedules for producing the proposal were 
then reiterated. Drafts sections of the proposal were to be completed by 
June 28 and subsequently reviewed both by members of the Research Council 
and by external reviewers and then revised as necessary. 

Review of proposed mission and research programs . As drafts of the 
mission and strategy, operational plans for ~e search and Institutional 
functions and Institutional capacity sectlc were completed, they were 
reviewed first by the Center directors and members of the Research 
Council and modified as necessary to Inov the coherence of the proposed 
work and Its methodological rigor. After Initial review and revision 
process, drafts of the entire proposal pac were reviewed thoroughly by 
both educational practitioners (Dr. Stevei nkel from Montgomery County 
Schools and Dr. Lynn Winters from Palos Vc s Unified Schools) and by 
noted researchers In the field of testing j evaluation (Dr. C. Robert 
Pace and Dr. Samuel Messlck). These Individuals were asked In particular 
to comment on the coherence of the mission, the Integration, significance 
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and methodological strength of proposed research programs and to offer 
suggestions for Improvement. Subsequent to these reviews, the proposal 
document was revised and the final document produced. 

Planning for collaboration . Concurrent with the activities described 
above, contacts were made with competitors for research centers which were 
likely to have overlapping Interests with a center on student testing, 
evaluation and standards. These Included the centers on writing, learning, 
effective elementary schools, effective secondary schools, state and local 
policy, post-secondary teaching and learning. Principal Investigators and 
other key perse nnel were contacted at Chicago, Johns Hopkins, Stanford, 
Pittsburgh, Teachers College, Michigan State, Wisconsin, Florida State, 
Rutgers, Hartford, and Berkeley to discuss potential areas of mutual 
concern and to agree, If successful In the competition, to future meetings 
devoted to planning collaborative ventures. Two Ideas for collaboration 
which evoked considerable Interest were participation In study groups aimed 
at Important problems In educational policy and/or practice (e.g., quality 
Indicators for the precolleglate and post-secondary levels); and sponsoring 
joint conferences exploring methodological Issues In conducting research 
and evaluation In a specific topic area (e.g., effective schools). 

Participants In the Planning Process 

The activities describe above Involved Individuals from the 
researcher, practitioner, and policymaking communities. These Individuals 
Included: 

Marvin Alkln, University of California, Los Angeles 
Gordon Ambach, Commissioner of Education, New York 
Ernest Anastaslo, EDUCOM 

Josle Bain, Los Angeles Unified School District 

Eva Baker, University of California, Los Angeles 

Adrlanne Bank, University of California, Los Angeles 

Darrell Bock, University of Chicago 

James Burry, University of California, Los Angeles 

Leigh Bursteln, University of California, Los Agneles 

Beverly Cabello, University of California, Los Angeles 

Dale Carlson, California State Dept. of Education 

James Catterall, University of California, Los Angeles 

William Cody, Supt. of Schools, Montgomery County 

David Cohen, University of California, Los Angeles 

Elaine Craig, University of California, Los Angeles 

Phil Curtis, University of California, Los Angels, 

Don Dorr-Bremme, University of California, Los Angeles 

Walter Feurzelg, Bolt, Beranek & Newman, Inc. 

Steve Frankel, Montgomery County Public Schools 

Calvin Frazler, Commissioner of Education, Colorado 

Howard Freeman, University of California, Los Angeles 

Gene Glass, University of Colorado 

Wayne Gordon, University of California, Los Angeles 

William Harris, Educational Testing Service 

Joan Herman, University of California, Los Angeles 

Ernest House, University of Illinois 

Pete Idsteln, Christina Unified School District, Delaware 
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Jim Johnson, University of California, Los Angeles 

Mary Johnson, Dept. of Defense Dependent Schools 

Tom Kerlns, Illinois state Board of Education 

Hard Keesllng, Advanced Technology, Inc. 

Harold Levlne, University of California, Los Angeles 

Robert Linn, University of Illinois 

David McArthur, University of California, Los Angeles 

Bernard McKenna, (NEA) National Education Association 

Joyce McLarty, Tennessee State Department of Education 

James Mecklenburger, National School Boards Association 

Samuel Messlck, Educational Testing Service 

Jason Mill man, Cornell University 

Bengt Muthen, University of California, Los Angeles 

James Olsen, HICAT 

Robert Pace, University of California, Los Angeles 

Sharon Robinson, National Education Association 

Edward Roeber, State Department of Education 

Gila Saks, University of California, Los Angeles 

Francisco D. Sanchez Jr., Supt. of Schools - Albuquerque (retired) 

Tom Satterfleld, Deputy State Supt., Mississippi Dept. Ed. 

Geoffrey Saxe, University of California, Los Angeles 

Richard Shavelson, University of California, Los Angeles 

Lorreta Shepard, University of Colorado 

Kenneth Slrotnlk, University of California, Los Angeles 

Marshall Smith, University of Wisconsin 

Mary Lee Smith, University of Colorado 

Harris Sokoloff, University of Pennsylvania 

Ellott Soloway, Yale University 

Robert Stake, University of Illinois 

Florallne Stevens, Los Angeles Unified School District 

Ron Tarr, U.S. Army Infantry School 

James Hard, American Federation of Teachers 

William Ward, Educational Testing Service 

Noreen Webb, University of California, Los Angeles 

Richard Williams, University of California, Los Angeles 

Lynn Winters, Palos Verdes School District 

Merlin Wlttrock, University of California, Los Angeles 
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CHAPTER TWO: Technical Report on R&D Mission 

This report outlines a mission for an R&D Center on Student Testing, 
Evaluation, and Standards. It begins with a brief review of the litera- 
ture, highlighting the authors' perceptions of current Important research 
directions for testing, evaluation, and standards. Next, a synopsis of 
current problems Is presented, followed by a conceptual framework for 
conducting R&D on these problems. The report concludes with a summary of 
the R&D objectives Inherent In the framework. 

The View From 1985 

During the last 15 years, testing and evaluation scholars and 
practitioners have learned a prodigious amount. They have redefined 
evaluation Impact so that It Is now much more than a simp] technical 
Issue. They have proposed models, approaches, analyses, and solutions to 
recurrent problems. During this period, too, evaluation and testing have 
come to play much larger roles In public policy. 

In testing, technical developments In Item response theory (IRT) 
(e.g., Bock & Aitkin, 1981; Lord, 1980) have provided a new and powerful 
means of attacking previously Intractable problems such as detecting biased 
test Items (e.g., Shepard et al, 1981), constructing and equitably scoring 
computerized adaptive tests (Green, Bock, Humphreys, Linn, & Reckase, 
1984), and creating and calibrating of multipurpose Item banks for the 
effective assessment of Individual students as well as Instructional 
programs (Bock, Mlslevy, & Woodson, 1982). The conception of testing has 
evolved from an unquestioned dependence on differentiation among students, 
to an emphasis on content encouraged by the criterion-referenced testing 
movement that followed Glaser's (1963) landmark paper. Concurrent with the 
renewed emphasis on content has been the forging of a promising linkage 
between psycho me tries and cognitive psychology (e.g., Brown & Burton, 1978; 
Curtis & Glaser, 1983; Tatsuoka & Tatsuoka, 1982). Together, these 
achievements represent the first step toward an Integration of testing and 
Instruction (See, for example the Special Issue of Journal of Educational 
Measurement , 1983). 

In evaluation, simple linear models of evaluation, thought to mirror a 
linear pattern of needs Identification, planning, Implementation, and 
evaluation (see e.g., Alkln, 1969; Stufflebeam et al, 1971), have been 
replaced by analyses that recognize the complex Interactions of technical, 
social, structural, and political environments (e.g., Bank & Williams, 
1984a, 1984b; Cronbach, 1982; Cronbach et al 1980; Patton, 1978; House, 
1977; Weiss, 1972). From simple, controlled studies of outcomes, design 
and data collection have been augmented to Include studies of how policy 
goals, Implementation and multl faceted Information Interact (e.g., Berman & 
McLaughlin, 1977; Cook & Campbell, 1979; Stake, 1978). Studies of 
evaluation have been enlarged to reflect a concern that the results be used 
by a range of decisionmakers (e.g., Alkln, Dalllak, & White, 1979; Bryk, 
1983; Relsner et al, 1982). 
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The mission of evaluation now goes beyond the analysis and judgment of 
particular programs (Cronbach et al, 1980). Its scope has expanded to 
Include consideration of how Integrated evaluation systems work and how 
they can be Improved. To that end, evaluation Information— and the testing 
programs that support It— should help to clarify standards; and the whole 
process should serve to target resources and to stimulate effort In areas 
of critical need. 

Our vantage point suggests that the models driving evaluation must be 
formative (Scrlven, 1967) and that they must attend to the shift In 
emphasis to state and local Initiative and responsibility. Educational 
Interventions are rarely treatments In the traditional research sense 
(Bur stein & Gulton, 1984); rather they are subject to a range of local 
adaptations, surprise turns, and altered expectations. As a result, 
formative evaluation requires a thorough understanding of the context In 
which evaluation findings are developed and are expected to be Implemented; 
of the social, structural and political contexts In which education 
resides, and of the pragmatics of life In the schools (Baker, 1981; 
Slrotnlk, Bur stein and Thomas, 1983). 

The efforts of the proposed CSTES must be guided by the following 
questions: What test and other Information creates the potential for 
Improvement? How should the quality of Information be judged and 
Improved— that Is, how can the Information be made more credible, valid, 
and ultimately useful? (Cronbach, 1982; see also our expanded view In 
Appendix 4.) The characteristics of useful Information depend upon one's 
perspective (Dorr-Bremme, 1983; Slrotnlk et al, 1983). To be useful to 
students and teachers, Information should probably be very specific, should 
be carefully timed, and should be presented In a way that takes Into 
account the limits of what can be productively absorbed. The way In which 
Information Is conveyed and displayed Is also Important (Slrotnlk & 
Bursteln, 1984). For Instance, school and district managers may require 
detailed analyses of educational services and policies rather than detailed 
outcome Information (Bursteln, 1981); higher-level policymakers may demand 
comparability of Information; and the public at large probably prefers 
credible generalizations without a lot of detail and backup evidence 
(Smith, 1984). 

The proposed CSTES must also be sensitive to possible conflicts 
between Information that will contribute to the top-down demand for 
broad-level accountability (to Improve management and to elevate standards 
of excellence) and the bottom-up demand for adaptive, sensitive Information 
that will be useful at the local level (Baker, 1983). These two sets of 
demands push In different and not-total ly- compatible directions. Some of 
the tensions are obvious. A testing and evaluation system whose purpose is 
Instructional Improvement requires Information which Is based on local 
expectations and resources, which Is adaptive to unplanned changes, and 
which Is timed so that options can be assessed and selected. But external 
requirements pull In the direction of comparable, more uniform designs for 
Information. 
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Both top-down and bottom-up points of view would be better served by 
tn expanded view of standards of educational quality. These data 
requirements extend beyond student attainment of particular subject matters 
or basic skills to information about student learning processes, 
educational services. Instructional processes, and Important contextual 
factors. Such a data base approach Implies that Information needs will be 
driven selectively by the pragmatics of the environment. 

What 1s the potential role of disciplined Inquiry (Howe, 1984) in 
addressing the competing expectations for Information, and what Insights 
can 1t provide into the concerns for quality expressed 1n A Nation at Risk 
(1983) and In other prestigious reports (Good lad, 1983; Boyer, 1983; Slzer, 
1984)? How an the new standards articulated by legislatures and by local 
school boards be connected to a broadened view of educational quality? 
Mhat can science, research and development, and conceptual analysis 
contribute to productive educational reform, and what are their 
limitations? One Important function of the CSTES 1s to answer questions 
such as these. 

This view from 1985 reflects our perceptions of the current Important 
research directions for testing, evaluation, and standards. Guiding these 
perceptions are several global beliefs: 

o Testing, evaluation, and standard setting can contribute to 
improving the quality of education . Tests — when they are well 
conceived, constructed, administered, and analyzed — can provide 
valuable Insights into how Individuals and classes of students are 
learning; they can help guide teaching, administration, and policy- 
making within our educational Institutions. Evaluations of pro- 
grams — especially when they are seen as Improvement oriented, 
locally useful, and Iterative — can help to guide the reallocation 
of resources, the modification and Improvement of activities, and 
the retraining of personnel. Standards — set with due attention 
both to what 1s desirable and to what 1s feasible at the state and 
local levels — can help to focus attention and promote account- 
ability for educational Improvement. 

o Testing and evaluation are Important tools for promoting 

educational equity . Tests, when they are sensitive to Individual 
differences and preferences In learning styles, provide a powerful 
means for diagnosing students' unique needs and providing effective 
Instruction for all students. Furthermore, tests, when they match 
classroom Instruction, can provide fair and equitable measures of 
student progress, measures which focus on learning accomplishments 
rather than background characteristics. Achievement measures as 
well as measures of educational processes and community context, 
can help to Identify areas where the needs of particular groups are 
being met and where more attention 1s needed, facilitating more 
effective programs for all. 

o Testing and evaluation should serve the needs of a multiplicity of 
users . Teachers may need test and evaluation Information to make 
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Instructional decisions; and local school and district admini- 
strators, as well as policymakers at the state and federal levels, 
need such Information to guide their planning and decisionmaking. 
If they are to be useful In supporting and Improving schools, 
evaluation and testing activities should be decentralized to the 
local level, while at the same time maintaining their utility for 
addressing legitimate public policy concerns at state levels In 
particular. 

o Testing, evaluation, and standard setting are endeavors which are 
partly technical, partly political, and partly social. Technical 
expertise Is essential In test development and analysis, to ensure 
the valid and reliable use of test results; social understanding Is 
essential to ensure fairness and utility. Similarly, evaluation 
questions arise out of people's Information requirements, while the 
design and Interpretation of evaluations depend on technical 
competence. The definition of standards depends on values and 
consensus; the measurement of their attainment Involves technical 
considerations. 

While we are optimistic about the potential of educational testing and 
evaluation, we also are aware of their current shortcomings, cognizant of 
their potential misuses, and sensitive to their possible unintended 
effects. A national center must play a vigilant role with regard to these 
concerns and functions as a consumer advocate to the field, analyzing 
current practices and Informing public policy. 

Problems In Practice 

Research In educational testing and evaluation has made Important 
strides In the last decade and Its methodologies hold great promise for 
Improving the state of education. Nonetheless, significant problems remain 
In educational practice, problems related to the quality and diversity of 
existing measures, to the validity of the Inferences that can be derived 
from these measures, and problems related to their utility to and Impact on 
the educational system: 

Problems related to quality of Information . 

1. Most of the testing and evaluation procedures currently used to 
assess students, programs and schools cover only a narrow range of the 
knowledge and skills that are the targets of schooling and do so without 
adequate attention to the nature of these knowledges and skills. For 
example: 

o The National Council of Teachers of English have long decried 
reliance on multiple choice tests as measures of writing skills. 
Associations of teachers of mathematics, of social studies, and of 
science have similarly criticized the content of existing tests 
and the levels of achievement which are assessed. 

o In the push to Implement new testing programs, some states and 
school districts have paid more attention to new psychometric 
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techniques than to the knowledge domain being assessed and Its 
cognitive underpinnings. 

2. Given what Is known about testing and evaluation design, tests 
tend to be of poor quality. For example: 

o The testing materials most commonly used by teachers, e.g., 
end-of-chapter tests, are often extraordinarily poor. They can 
mislead the teacher Into believing that students have learned 
when, In fact, they have not; or that remedial exercises are 
needed when, In fact, more advanced materials would help to 
enhance learning. 

o The bells and whistles of the computer revolution and Its slick 
print-outs often give an undeserved aura of scientific rigor to 
score reports. What the reports fall to convey Is the 
arbitrariness of many classifications (e.g., "mastered" vs. 
"failed to master") and the poor reliability of the Information, 
which may be based on only two or three Items per skill. 

3. Bias In the assessment of achievement for special groups Is a 
continuing problem. For example: 

o While concerns for bias have alleviated mary problems of 

stereotyping, teachers report that many formal tests are unfair 
for their students. 

o Sophisticated psychometric techniques have been developed to 
Identify biased Items but the source of the Identified bias 
often remains unknown. 

4. The quality of measures at the post-secondary level Is 
particularly problematic. For example: 

o College admission measures serve as the primary Indicator of the 
entire precolleglate system, Ignoring other Important outcomes and 
alternate postsecondary experiences. These measures, In addition, 
are not well articulated with either precolleglate curriculum or 
with post-secondary course offerings. 

Problems related to quality of Inferences . 

5. Most testing programs and evaluation systems devote scant 
attention to the mediating factors, e.g., the quality of educational 
processes, background variables, and other contextual characteristics, 
which are basic to understanding student performance. For example: 

o Every year, a metropolitan newspaper In California ranks schools 
In terms of their students 1 scores on achievement tests. Missing 
from these public reports Is any consideration of the factors that 
may explain differences or changes In rank, such as a sudden 
Influx of children from different language backgrounds, high 
transiency rates, and absence rates. 
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6. The Federal concern for developing a National Report Card 
underscores the need for state and national level Indicators of overall 
educational quality, but many problems remain. For example: 

o The component Indicators of quality receive considerable attention 
but tend to focus on grossly, uncertainly defined but more easily 
accessed datasets of macro variables, e.g., dropout, student 
"achievement" data (like the SAT examination), teacher academic 
history. Neglected Is the broad picture of Input, process, and 
outcome Indicators which might provide the critical context for 
understanding and judging comparative quality. 

o Potential sources of valid student performance data exist In 
ongoing state assessment programs, for Instance, but 
Investigations of means for aggregating such Information are only 
just underway for state by state comparisons. The Importance of 
test content receives less attention. 

7. Concern for student achievement and the quality of American 
education escalates each time an International comparison of student 
performance Is conducted. Yet there has been little consideration of the 
use of International studies, or the measures generated by them, as 
benchmarks to protect America's ability to compete In technological, 
academic, and economic futures. For example: 

o The Second International Mathematlc Study provided a comparison of 
the United States and 20 other countries. Results show that the 
United States performed relatively poorly In comparison with 
Japan. Less serious consideration was given to the meaning of 
these data with respect to the role that content coverage, the 
quality of Instruction, or the differences In background, 
abilities, and attitudes might play In the highlighted performance 
differences, although data are available on these student and 
Instructional characteristics are available. 

8. Because different types of decisions (e.g., policy, Institutional, 
Instructional, counseling) require different types of Information, a 
patchwork system for collecting Information has been created. Not only are 
the testing and evaluation procedures used unnecessarily Intrusive, but the 
Information produced Is overly redundant. The redundancy may be 
particularly acute for special populations. For example: 

o Children participating In a Chapter I program at a mldwestern 
school must take the CTBS In the fall and again In the spring, In 
addition to mandated state assessment tests, a dlstrlctwlde 
norm-referenced test, and an array of curriculum-embedded tests. 
The Information from these tests Is never Integrated is largely 
redundant, and only tangentlally Influences teaching practices. 

Problems related to utility and Impact . 

9. Student testing programs on which much of evaluation depends, are 
externally Imposed, from the top-down, but the use of data for local school 
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Improvement Is a bottom-up proposition, local and specific In nature. The 
result Is data of limited utility for teachers and school administrators. 
For example: 

o Extensive Interviews with district administrators, principals, and 
teachers In one mldwestern school district found that while each 
of these groups believed the tests had value for the system as a 
whole, each group also said the tests were not germane to Its own 
needs. Thus, district administrators said that tests were helpful 
to teachers; teachers thought them useful to principals and 
principals felt they were essential to district administrators. 
In short, no group acknowledged that It found such Information 
valuable. 

o According to a national study of teachers 1 use of testing, 

teachers reported very little practical decisionmaking based on 
formal testing because of the mismatch of test content and 
Instruction, poor reporting formats, and Inappropriate timing of 
results. 

10. Schools are supposed to be vehicles of social mobility and equity, 
giving all students an opportunity to achieve and to reap the benefits of 
productive participation In society. Although rigorous testing systems are 
supposed to contribute to this process, evidence suggests that testing may 
actually Impede social mobility. For example: 

o According to a prestigious national study of schooling, testing 
has contributed to the tracking of students Into rigid vocational 
and academic lines, thereby reducing the prospects for Individual 
growth and satisfaction. 

o The treatment of special populations (e.g., children from 
different language backgrounds or with different developmental 
histories) often amounts to placement In dead-end tracks with 
little opportunity for change or advancement. 

11. Tests and evaluation are regarded not only as processes for 
assessing educational quality, but as significant Interventions In 
themselves that will promote excellence and high standards. There Is 
widespread belief that the Imposition of testing systems will focus and 
motivate learning, but other effects contrary to excellence may also 
accrue. For example: 

o One eastern school district, echoing teachers 1 concerns In a 
national study, reported substantial narrowing of the curriculum, 
away from science, art, history and higher level skills and toward 
the basic skill areas assessed on mandated tests. 

o Acceptable pass rates are a political necessity, resulting In 
cut-scores that reflect neither excellence nor even minimum 
competency. 



ERLC 



17 



14 



These three problem clusters, quality of Information, quality of 
Inferences and Interpretation, and utility and Impact of testing and 
evaluation reforms are central In the conceptual framework underlying the 
proposed research program. This conceptual framework 1s described next. 

Assessing and Improving Educational Quality : 
Conceptual Framework for the CSTES 

He take as a point of departure a model of the Educational Quality 
Improvement Process (EQIP). This EQIP model portrays the role of testing 
and evaluation In Improving educational quality. The model 1s grounded In 
our understanding of the nature of the educational context, which we 
explicate next. Two critical requirements for the model are then 
described, validity of Information and quality of Inferences; the effects 
of these requirements ultimately Is judged by thelrutlllty and Impact on 
educational quality. These requirements and their Impact are the focus of 
a substantial portion of our MD program. 

Our goal 1s to conduct MD that contributes both to better 
understanding of educational quality and to Its development as well. Our 
simplified picture of the role of testing and evaluation In Improving 
educational quality 1s presented .n Figure 1. 
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The Educational Quality Improvement Model (EQIP) 

The model displays the Interaction among the formulation and 
Implementation of educational policies and practices and the assessment and 
judgment of their quality. At the simplest level, educational policies are 
formulated to Influence educational practices. But much of educational 
practice also develops bottom-up and on an Informal basis. These existing 
policies and practices create the actual level of educational quality 
experienced by students and teachers. The next step Is an assessment of 
educational quality, a process that can address only partially the true 
quality of effort and Its effects. Following assessment, judgments are 
reached about how well policies and practices are working. These judgments 
may be strongly Influenced by explicit standards but also develop from a 
wide source of other values. The model Is arrayed In a circle to Indicate 
that this process Is neither discrete nor linear, and Its components are 
set In Important contexts which significantly affect and are affected by 
their operation. We have described one point of entry In the model, 
starting with the formulation of educational policy. Taken at another 
entry point, judgments of quality (substantiated or unsubstantiated), or 
attention to explicit standards, lead to assessment or assessment policies 
and practices which In turn affect other educational policies and 
practices. Here assessment Is acting as an Intervention. From a third 
entry point, assessment of quality can Identify needs for new 
Interventions 1n policy and practice, which are subsequently assessed, 
judged, and become the subject of continued or modified action. 

Throughout the model, there Is recognition of both Implicit and 
explicit meanings and realities and of formal and Informal sources of 
Information. (Llndblom and Cohen (1983) have been Informative on this 
point.) For example, the model recognizes that formal policies provide 
only general guidelines and exert Imperfect control over actual practices 
at the various levels of the educational hierarchy. Second, policies and 
practices are dependent on formal and Informal assessments which provide a 
narrow and Imperfect estimate of reality. Third, the model recognizes that 
judgments about quality require the Integration of various sources of 
Information against general values and expectations for education, only 
some of which are represented In explicit standards. Fourth, the model 
acknowledges, with the Intent to explore, the effect of contextual factors 
on the assessment and judgment of quality. These factors Include changing 
policy expectations, social, organizational, political, and demographic 
factjrs and resources which are In constant flux and which can only be 
grossly approximated for any period of time. 

Educational Contexts of the EQIP Model 

The EQIP model Is grounded In our understanding of how the educational 
system operates. Below we present three views which are essential to our 
understanding. The first Is a hierarchical view of the multiple policy and 
administrative levels responsible for the educational system. The second 
Is a longitudinal view of the educational system and Its Interdependent 
segments. The third 1s a pluralistic view of the system's clients, Its 
students. 
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Hierarchical view of the educational system . Figure 2 depicts a 
hierarchical view of the Multiple policy and administrative levels which 
are responsible for the quality of educational policies and practices. 
While the picture portrays the system as a neat configuration of nested 
entitles, the concentricity of the circles Is neither neat nor closed. A 
hallmark of the American educational system, and one which complicates both 
Its evaluation and governance. Is that the system is "loosely coupled" 
(Melck, 1976), with each of the lower levels exerting significant 
Independence. 




The figure shows the student at the center, as the primary client and 
ultimate recipient of educational quality, surrounded by the various 
contexts which Influence the quality of education: classroom, school, 
local district, and state educational Institutions as well as national. 
International and socio-political contexts. At the post secondary level, 
this picture would omit "local district," except for certain community 
college venues. For private Institutions, the state level may or may not 
have relevance. The Intent of this picture Is to Illustrate that policies 
at various levels, translated Into actual educational practices, have 
successive Impact, with direction of Impact both outward and Inbound (that 
Is, "bottom up" and "top-down.") These policies may have direct Impact on 
students In the case where they completely traverse the entire system 
(e.g., lengthening the school day). Or the policies may affect students 
less directly and depend on a chain of assumptions about the relationships 
between certain factors and educational quality (e.g., raising teacher 
salaries). 

The point Is that policies and practices at all levels, and the 
Interactions among them, affect the ultimate quality of education 
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experienced by students. To Improve policies and practices* as well as to 
promote accountability, we believe that all practitioners and policymakers 
need Information about educational quality (I.e., Information about the 
quality and consequences of students' classroom experience). Overlapping 
assessment systems have mushroomed In an attempt to provide such 
Information for each level (e.g., routine classroom assessment, district 
evaluation programs, state assessment) yet these assessment systems, like 
their corresponding organizational structures, are not necessarily 
congruent In focus. 

Longitudinal view of the educational system. While the hierarchical 
view describes the multiple administrative levels Involved In the system, • 
the longitudinal view Is concerned with multiple Institutional levels. The 
longitudinal picture Is essential to examine the quality of the system as & 
whole and to assess Its effectiveness In educating and preparing the 
populace for productive lives. Figure 3 presents this longitudinal view of 
educational services and outputs, displaying the path a student takes from 
school entry through critical transition points to various exit points: 
entry to elementary school, the end of sixth grade, the end of junior high 
school, the end of high school and various pest secondary options 
(sometimes commencing before forma"! graduation. Including traditional 
college and university enrollment; technical tranlnlng, employment, the 
military, and non-productive outcomes (unemployment. Incarceration, 
etc.). 
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FIGURE 3: LONGITUDINAL VIEW OF EDUCATIONAL SYSTEM 



What Is the relationship of quality assessment to this longitudinal 
view? First, there are both short-term and longer term effects. In the 
short term, the success of students at any point In the continuum can be 
used to estimate the cumulative effects of earlier educational services, 
conditioned, of course, by contextual variables. Taking the longer term 
view, the figure reminds us that there are various legitimate outcomes of 
education and that choices other than college for students should be 
Included within the educational quality assessment paradigm, e.g., success 
with business and corporate training requirements, the entry and retention 
of Individuals In employment, and their entry and success In the military. 
(SAT scores and other measures of college preparation, In other words, give 
at best an Incomplete picture of educational quality.) 
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A second Implication of the longitudinal view Is the obvious 
Interrelationships between educational levels and the need for 
articulation at the Identified junctures. Clark writes eloquently of the 
mutual effects of precolleglate and post-secondary systems, effects that 
often are mediated by tests and other assessment and placement devices; his 
words apply to junctures between other levels as well: 

We may conceive of the relation between secondary and higher 
education at the outset as a two-way street along which the 
nature of traffic In one direction Is quite, different from 
the flow of people and activities In the other. Up the 
street, from the "school" to the "university," we encounter 
primarily a flow of students. The school selects them, 
trains them, orients them, certifies their competence, and 
sends them on... Whatever the quantity and the quality, and 
the degree of opportunity, the school clearly shapes the 
human resources made available... In education, generally, 
an Impelling principle of sequence gives lower units this 
particular role In determining the nature of higher levels. 1 
Down the street, from the university to the school, the 
traffic Is different, consisting always of two major 
vehicles of Influence. One Is personnel... A second vehicle 
Is currlcular In nature: the university sets course 
requirements for Its own students, and often Itself sets 
entry requirements that Influence what teachers will teach 
and what students will study In the school. Students who 
want to go on must master those materials and pass those 
examinations that permit them to be a part of the upward 
flow. 

Burton R. Clark (1&85) 

Just as the hierarchical view of educational systems highlights the 
need to be sensitive to the needs of various levels of the system 
hierarchy, the longitudinal view encourages sensitivity to various levels 
of schooling. For example, the modal organization of elementary education 
Is the self-contained classroom, resulting In the need for multidimensional 
Indicators aggregated within classrooms (e.g., performance In different 
content areas, self-concept, attendance). The departmental organization at 
the secondary level presents an opportunity for more content detail and the 
challenge of aggregation of students across teachers and blurs lines of 
accountability for basic skills. Explicit course choices and differing 
educational goals contribute additional complexities to Indices of quality 
at this level, as do Issues of different classroom organizations and of 
problems of articulation among grade levels. 

The pluralistic view of the educational system . The dramatic 
diversity of students served by our schools provides a third Important 
context for our conceptual framework. Students come from a variety of 
backgrounds, ethnicities, and communities. They exhibit different ability 
levels, cognitive approaches, language facility, and Interests. Many 
students have special needs for educational attention: physical handicaps, 
learning disabilities, and or highly developed talents or 
aptitudes. Students aspire to the full range of accomplishments a form*! 
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educational system can provide. And the American system Is committed to 
helping them to satisfy such goals. It has assumed dual responsibilities 
for addressing Individual differences while communicating the common 
learnings necessary for an Integrated and unified society. 

^cognition of the energy and richness of our student population 
brings a serious set of concerns to the picture of educational quality. Me 
confront choices related to how much attention should be paid respectively 
to maintain diversity or to Increase commonalities In our educational 
policies and practices. We must also take Into account student differences 
as we attempt to assess the quality of the policy choices we make. Here 
our concern relates to the fairness Issue. He balance equity Interests 
with attention to standards. Different standards for minority students, 
for Instance, Increases diversity but reduces the fairness In the long 
run. Technically, we have advanced In methods for detecting bias In 
measurement, and for correcting statistical differences attributed to 
varying levels of student performance, but much remains to be done In the 
area of what and how we assess our diverse student body and how we make 
confident Inferences from our findings. 

American educational variety also grows from the diversity of the 
communities In which our students reside. They may live In urban, 
suburban, or rural settings. Their communities may be stable or radically 
changing. The economic productivity surrounding them may be vigorous or 
tenuous. Their schools may be large complex organizations or smaller, more 
personalized settings. They may be very like other students In their class 
In background or represent a minority of one or another type. Attempts to 
Improve educational quality and to assess Its quality will succeed only to 
the extent that these Important factors are considered In our analyses and 
our actions. 

Our EQIP model and the three contexts above present a backdrop for our 
approach to assessing educational quality. For the model to be 
successfully used, two critical requirements must be met . One requirement 
Is valid Information; the other Is quality Inferences derived from that 
Information . Clearly, these requirements are necessary but not sufficient: 
Social, organizational, political and simple, human preferences Influence 
our policy choices and our Interpretations of their success. But valid 
Information and high quality Inferences are at the core of the EQIP model. 
They are amenable to conceptual study and empirical Improvement and are 
appropriate to the fundamental Issues to which the CSTES Is directed. 

Requirement One: Validity of Information 

Validity of Information Includes concerns with the accuracy, 
representativeness, and comprehensiveness of what Is claimed to be 
measures of educational services and effects. While It Is tempting to cast 
this argument broadly In terms of the R&D needs of full range of 
Information that might be Included In the assessment of education, we plan, 
at the outset, to focus substantial effort on the Issue of student 
achievement and performance. The reason for this decision Is not to 
disdain the utility of other Indicators of performance; to some extent we 
will address these as well. But rather we believe that this R&D Center 



ERIC 23 



20 



should help others to estimate teacher quality, school effects at the 
elementary, secondary, and post secondary levels, Impact of state and local 
policy, and content-based work In the disciplines rather than devoting the 
majority of our resources to areas to be addressed by other Institutions. 
Nowhere, however, Is the responsibility for exploring fundamental Issues In 
student performance more properly assigned than to the Center for Student 
Testing, Evaluation, and Standards. 

We believe that much of the present Information on student achieve- 
ment Is based upon anachronistic models of human learning and often does 
not reflect the best avallablepsychometrlc and statistical models. 
Improving the quality of student achievement Information depends on rela- 
tively sophisticated notions of validity. Four points deserve particular 
attention: content quality; appropriate approaches to content assessment; 
cognitive basis of Individual differences; and assessment purpose. 

Content Quality . Research on student assessment must take Into 
account tne content of what Is being tested. General levels of content 
specification must be augmented to reflect research on cognitive knowledge 
representation as well as to the more significant concepts In the field, as 
judged by scholars In the academic disciplines (Haertel and Calfee, 1983). 
Quality content must be at the core of any measures. With respect to 
subject matter content, recent advances In cognitive science are pertinent 
(Larkln, et al, 1980). They suggest that models for content-based analyses 
must be specific to subject matter structure for the design of procedures 
to find out what students really know and can do (Shave! son, 1983). These 
procedures may be much more susceptible to differences In the way content 
Is organized within the discipline. If so, then much less general rules of 
thumb for achievement for achievement test design will be required, and one 
challenge will be exploring the limits of general test development 
procedures versus the need to create separate models useful for assessing 
different content areas. 

Appropriate approaches to content assessment . Validity In measurement 
also depends upon the belief that the means available for assessment are 
appropriate to the subject matter. A case In point was the dependence upon 
multiple choice measures to assess students' ability to compose essays. 
Though logically Indefensible, this practice persisted because of the ease 
of computing reliability estimates, the low cost of data collection and 
scoring, and the reliance on correlations to show that ability In 
composition correlated with these measures at some respectable level. 
Research studies of writing assessment (Hays, et al, 1980) however, 
demonstrated that the cognitive demands of written composition were vastly 
different than those of selecting responses. The development of practical, 
efficient, and reliable scoring strategies combined with these cognitive 
analyses to permit the more valid assessment of this critical skill area 
(Quellmalz, 1985). Similarly, It may be demonstrated that certain problem 
solving tasks In science or analytical tasks In comprehension of literature 
may be better assessed by means other than traditional multiple choice or 
short answer tests. Determining what options there are and marrying those 
findings with what scholars feel are sensible approaches to the assessment 
of their content areas could result in a broader, differentiated mode of 
student achievement measurement. 
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Cognitive bases of individual differences . A third Issue In valid 
achievement assessment is the extent to which any approach permits students 
to demonstrate the most, or the best that they know. Bringing back the 
attention of researchers to the Individual differences among students 
test-taking preferences may allow us to assess more accurately what 
educational effects are. Attention to alternative symbolic representations 
of task (aural, pictorial and dynamic) and to options In response modes, as 
developed from cognitive and subject matter analyses, may allow the 
creation of more diverse testing systems. These options can help to 
overcome the criticism of uniformity, triviality, and narrowness of current 
testing practices and reflect more directly the reality of the enormous 
variation In cultural, experiential, and learning histories of our 
students. What we need to explore are alternative options for teachers and 
students to demonstrate educational achievement, options that are not 
easier or harder, or preferred rather than undesirable, but assessment 
choices that share rigor and credibility, if this exploration is 
successful, our contribution to the validity of Information will be clear, 
and Ideas about what "difficulty" of tests means might undergo 
redefinition. 

Assessment purpose. A fourth area of validity In achievement measures 
relates to the purpose for which the measure Is used. While achievement 
test purposes are commonly thought of as diagnostic, placement, monitoring 
and certification, with different models of testing proposed for each, our 
particular Interest focuses on validity as It relates to student learning 
In the Instructional context. An Important issue Is the extent to which a 
single assessment system Is valid for a variety of purposes. The types of 
tests which teachers most frequently use and accept (Herman & Dorr-Bremme, 
1983, Dorr-Bremme, 1983) deserve particular attention. Such systems need 
t? combine attention to design, psychometrlcs, and new technologies. 
Research will explore ways to Increase both the validity and utility of 
such systems. 

Requirement Two: Quality of Inferences 

The central thrust of efforts to Improve the quality of Inferences 
from educational Information Is to build our confidence that the bases for 
judgment, evaluation, subsequent action, and consequent Impact on the 
educational system are as accurate and circumspect as we can make them 
within existing knowledge and resource constraints. The Issues here are 
legion. First, we have concern with the proper linkage of Information to 
any given primary decision context. Second, we are concerned for the 
multi-level, multl -Institutional use of Information. What distortions 
occur when Information collected for one purpose is applied at another 
level? Third, we are Interested In economy, to avoid burdening the system 
with more and more Information of less and less utility. Methodological 
options for creating linked data bases may provide a solution. Fourth, we 
are Interested In the comparison Issue. Given a set of information, how do 
we know what to make of It? Last, we maintain an Interest In expanding, as 
appropriate, the Information base to ground and elaborate our 
Interpretations. 
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Multi-level Inference . Let us treat the first three questions 
togetner, as they pertain specifically to the multl -level, multl - 
Instltutlonal problem, remembering that Inferenclng Is at the core of 
evaluation processes. A primary problem Is that evaluation practices at 
each level of the educational hierarchy operate relatively Independently 
from one another. State level Information Is specially designed, 
collected, and analyzed for state purposes; districts use a different set 
of measures for their decision needs; to the extent that teachers use 
formal Information sources for their Instructional decisions, they tend to 
rely on those provided with curriculum materials or developed on an ad hoc 
basis. On the surface It seems reasonable to assume that different 
measures are needed to meet the unique decision context at each level, and 
It might be argued that an overlapping testing strategy permits 
trlangulatlon that supports validity. However, there are serious problems 
within such an approach, with tensions between the need for more 
generalized measures as one moves up the hierarchy and the need for 
sensitivity for what actually transpires at the lower levels. For example, 
a primary function of achievement testing at the state level frequently Is 
to ascertain what students are learning with regard to a state curriculum 
framework. The framework Is typically specified at a general level as are 
the measures which assess It. The resulting assessment may or may not be 
sensitive to either the variations In specific curriculum Implemented at 
the district level, or to variations In Instructional programs Implemented 
by teachers In each district. Or more to the point, we can be sure that at 
least some mismatches will occur at each level, mismatches that 
compromise the validity of the assessment for some purposes. The 
assessment will always miss Its mark and add both noise and valid 
Information to the system. 

Other problems arise when there Is no common basis for Inferences 
about educational policies and practices at the various levels. The 
general Intent of educational policy formation Is to Improve the quality of 
educational services and to help our students attain the highest levels of 
competency In school subjects. At some time, the policies need to be 
translated Into practices that are compatible with understandings at the 
levels of real Implementation — ultimately with what teachers and students 
see as their requirements and day-to-day practices. There Is high 
potential for slippage when the Information used to assess quality and 
formulate policy functions Independently from that used to actually teach 
children. While It would be neither appropriate nor profitable to envision 
a fully articulated system where Information useful for Instructional 
decisionmaking Is also employed at the highest policy levels, the present 
low level of overlap creates special and persisting anomalies. It also 
causes unnecessary costs — In financial resources devoted to test 
administration and scoring and In opportunity costs related to teachers' 
and students 1 Instructional time. Current duplicative systems, In other 
words, may be both Ineffective and uneconomical. 

Part of the effort of this Center will be to explore the limits of 
common or compatible Information bases for multilevel educational 
decision-making, particularly In the area of student achievement. In the 
name of economy, of preservation of student time, and of quality 
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Inferences, we believe attempting to move the various levels toward some 
larger proportion of shared data has writ. Embedded in this problem are 
Methodological issues related to Integrating horizon tally different kinds 
of Information appropriate within a level, to Unking and equating locally 
sensitive Matures, and to summarizing and Integrating Information 
vertically (or decomposing down) so that the appropriate level of detail is 
available for information users (Baker, 1983; Bursteln. 1980, 1981, 1983). 
All these methodological issues again are nested in educational contexts 
that must be taken Into account substantively and methodologically to 
reflect the special character of different levels, and facts of Individual 
differences: among children, teachers, schools, communities, districts, and 
states. 

Me are also Interested In the extent to which common or linked quality 
assessment can Inform us about the cumulative effects of education across 
the longitudinal view of the system, as presented in Figure 3. Here the 
concern 1s to Include Indicators that are sensitive longitudinally to 
educational quality as exemplified at different Institutional levels, e.g., 
elementary, secondary schools. At present our Information 1s woefully 
limited. Can we tell If student effort is qualitatively maintained, 
increased or decreased at identified institutional points. Can we estimate 
cumulative effects? Can we assess the articulation of programs across 
school levels? In general, the answer to these questions is a resounding 
no. Our Interest, then, 1s to develop measures that have clarity and 
continuity. And we need ways to link Information between grades within 
particular Institutions and across Institutions to provide ecologically 
valid Inferences about student progress over time. 

The potential benefit of a multi-level, multi-Institutional approach 
to Interpretation Is clear. Not only could the 1ntrus1veness of testing 
and measurement be reduced, but the validity and linkage of Inferences 
could be enormously strengthened when policymakers and teachers share a 
common core of Information (If not at the same level of detail) to guide 
both their policy formation and educational practices. 

Comparative Inference. As noted earlier, valid Inferenclng raises 
questions of comparison^ Despite wishes and dreams to the contrary, 
comparison 1s an Important fact of life In educational evaluation and 
policy assessment. Although the habit of comparing students on percentiles 
has waned as the favored metric of educational quality, there 1s strong and 
abiding concern with the comparative quality of educational services, 
organized In schools, districts and In states. Comparison is at Issue In 
determining the merits of regular, on-going educational enterprises, but 1s 
more readily understood In the context of judging the effects of an 
Intervention. 

A first approach for judging the cumulative Impact of an Intervention 
Is Its effects over time on existing Indicators regularly used to track 
educational practice. These may be as homely as regularly administered 
standardized tests with all their known technical limitations but 
undeniable public credibility. Or a broader range of indicators could 
Include dropout rates, attendance, and performance on tests sponsored by 
administrative levels beyond the school (such as district wide competency 



27 



24 



measures, or state assessment.) Using trends over time for comparison Is a 
complex matter because of changes In measures, trend Interpretation, cohort 
differences and the operational meaning given to measures In different 
localities, and so on. But looking over time to determine whether student 
performance and regularly tracked processes Improve on a range of 
Indicators Is an obvious and Important first step In assessing progress. 

A second kind of comparison fits within more traditional concepts of 
"external criteria" where broad effects are gauged Inferential! y by 
analyzing Indicators remote from the school sites where education takes 
place. A principal example, is the use of postscondary Indicators to judge 
the quality of pre collegiate education. Witness the enormous attention 
paid to the decline in SAT scores which are Interpreted as evidence of the 
decline In overall quality of schools. This approach has a number of 
problems. Even putting aside contention about the meaning of such blended 
aptitude and achievement measures, postsecondary admission statistics can 
no longer alone serve as unquestioned standards for precolleglate 
educational effects. For one thing, a singular focus on college admission 
misses the goals and organization of the comprehensive high school and the 
diversity of Its student goals (Sykes, 1985). But even for the population 
segment aspiring to postsecondary education, the use of admission 
performance and acceptance rates Is not easily Interpreted because of 
contextual circumstances or conditions. For example, the pressures on 
postsecondary Institutions to fill available student slots, coupled with 
the traditional committment In the United States to open access to 
postsecondary schooling, make college Intake numbers less convincing as 
Indicators of public school performance. What might be credible measures 
suitable for comparison are what happens to students In college, how they 
perform, how they demonstrate the quality of their academic preparation 
(Pace, 1983). What sense should be made, for Instance, of the extensive 
remedial efforts now required by two year colleges and even prestigious 
research universities for their entering students? Certainly these efforts 
suggest that the quality of schooling cannot be easily glossed over In 
terms of distributions of students moving to higher education. A serious 
effort in this arena opens up the questions of what postsecondary education 
Is, who It serves, and what Its effects could be. Clearly, postsecondary 
Institutions have conducted evaluation efforts, directed at ranking on 
Institutional criteria faculty, libraries, research productivity, and so 
forth. But for postsecondary Information to serve more than a mystic, 
habitual Indicator of public school preparation, the quality of student 
learning In college will need to be directly addressed and soon. 

Another obvious comparison option Is the relative quality of schools 
(districts, states) with respect to a national standard . Because of the 
local organization of education, a clear criterion for comparison has not 
existed. But there remains a continued tension between the desire for a 
"national" picture and the local authority for educational services. The 
pull of a national achievement Indicator Is attractive, but resistance Is 
also strong for constitutional and for less lofty reasons. A national test 
could be created (and Is periodically suggested), but only at the risk of 
reduced local validity of findings for diverse student and Instructional 
settings. The tradeoffs of uniformity vs. some direct measure of national 
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performance have been partially addressed by the National Assessment of 
Educational Progress. But since Its design was originally not Intended to 
provide comparisons linked directly to the educational efforts of 
bureaucratic units (Wlrtz and LaPolnte, 1977), necessary changes In the 
frequency, types, and distribution of NAEP test administration could 
sharpen contention and reduce compliance. Alternative processes for 
providing more valid national comparisons are under development (Bursteln 
and Baker, 1985; Bock, 1985) related to the use of existing state level 
Indicator data to feed Into analyses conducted by the National Center for 
Educational Statistics (Elliott and Hall, 1985). But unless there are 
significant policy changes, the quest for a national comparative base for 
educational Impact will continue to be satisfied by partial, qualified, and 
In some minds, appropriately blurry Information. 

A last arena for comparison that has grown In attention Is American 
educational quality contrasted to that produced by other countries . 
Through International studies such as those conducted by IEA (Purves, 1980, 
Travers, 1984, Bursteln et al, In press; Baker, 1985), the standing of US 
students Is judged on Internationally arbltered performance measures. The 
utility of Inferences from International comparisons can easily be 
challenged: educational systems differ dramatically In terms of tradition, 
size, centralized management, tracking, selection, and access of students. 
On the other hand, such comparisons do provide an Imprecise but compelling 
benchmark: when all things are considered, how well do US students do? 
Yet, any International comparison should also answer the question of what 
else can US students do and where else do they show deficiencies. At any 
rate, It Is dangerous to assume that education In the United States should 
adopt Japanese Instructional practices or French or British examinations 
systems. Clark (1985) points out countries which emphasize high school 
exit rather than college entrance examinations have traditions of academic 
excellence, prestige for teachers teaching the highest track. These 
countries can demonstrate tight linkages between secondary and higher 
education excellence when there Is concommltant tight tracking and 
selection processes In the lower schools. Many of these conditions run 
smack Into American traditions of access and equity, the historical, If 
unfortunate role of teacher education In the University, among a complex of 
factors. So Inferences from such International comparisons may create 
general competitive goals rather than a specific all to adopt practices of 
other countries. These Inferences should be made carefully, and must 
attend to systemic differences In the organization of education as well as 
the surface features of examination processes. 

Expanding the band of Information . It Is a fact that much of 
precolleglate school evaluation activity depends upon measures of student 
achievement. The usefulness of such Information depends upon not only 
validity Issues Identified earlier, but on the extent to which such 
Information adequately represents educational quality. It Is our judgment 
that the present dependence upon achievement tests grossly underrepresents 
Important dimensions of educational quality. Just as we hope to expand the 
base of valid measurement of achievement, we also wish to expand the range 
of Information used In evaluation systems beyond achievement, to Include 
other Important Indicators of quality (Slrotnlk et al, 1983). Construct 
validity In an achievement area has been pursued by combining various 
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achievement tests. Here we are pursuing the combination of achievement 
measures with other Indicators to assess more validly a larger construct, 
that of educational quality. 

To obtain a full picture of educational quality, properly 
contextual 1 zed, Is probably a fool's errand. To obtain an Improved 
picture, with broader focus, Is within our grasp. Our measures should 
address variables of context, Inputs, processes, and outcomes. Context 
measures Include characteristics of children, Including language 
proficiency, socioeconomic conditions of settings, transiency rates, and so 
on. These facts often are overlooked In simple evaluation studies and 
often seriously Influence appropriate Inferences about educational 
quality. Input variables Include measures of financial support, quality of 
teachers attracted to the system, quality of physical surroundings, etc. 
Process variables Involve the Interactive behavior of administrators, 
teachers, students and parents, students' Instructional activities, 
Including such things as time on task, expectations for learning, parent 
Involvement, and teacher satisfaction. Outcome measures Include 
standardized achievement tests, measures of student production (such as 
writing), student attitudes toward school and learning, dropout etc. Here 
the concern Is selecting variables that are likely to be relevant to the 
Intervention assessed and selecting measures that meet criteria of validity 
similar In scope (but not In nature) to- those Identified for achievement 
measures In the section above. In selecting variables and measures, our 
Interest Is In Identifying an optimal number for sensible Interpretation. 
Of special emphasis Is the relationship among process and outcome measures, 
especially the extent to which changes In process may serve as proximal 
predictors of student outcomes. This particular concern derives from the 
checkered history of comparative evaluations where measures of 
Instructional process were rarely undertaken, or when processes were 
measured, treatment differences were often undetectable (Bursteln, 1981; 
Stake, 1978; House, Glass, McLean, and Walker, 1978). Measures of 
organization process (Williams and Bank, 1984) also appear to be Important 
predictors of Intervention effects. We do not see the mission of the CSTES 
as being principally concerned with the Identification of these variables, 
for this task Is better accomplished by other R&D centers (related to 
specific levels of schooling, teaching, etc.) Furthermore, expansion of 
the set of educational quality Indicators, although a strong Interest In 
our present proposal, Is also being addressed by other organizations, such 
as the NCES and National Academy of Science, to name but two of the main 
actors. Our Interests are to assure that places are held In evaluation 
systems for such variables and to assist In the measurement Issues 
attendant to their application. 

How shall Indicators be conceived? The economy and validity demanded 
by multi-level application of measures should be a concern as we attempt to 
broaden our Information base. Clearly, the prescience that Information 
will be applied at different levels will Influence the nature and form of 
the questions posed. If economy of effort Is a serious matter, then 
agreements must be made on apparently simple matters such as format and 
meaning of variables. These agreements about the range, type, real 
meaning, and formats of Information will generate tension In the vertical 
operation of the system (among Information providers and users from the 
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classroom, school, local district, and state levels). Different but 
equally Important Issues will need resolution as coordinate Information 
needs generalize horizontally (State to State, for Instance). And another 
set of contentions will be addressed by members of different Institutions, 
e.g., elementary and secondary schools, community colleges, and 
universities, who attempt to find Indicators to assess the articulation and 
cumulative effects of multi-Institutional systems. 

Integrated educational quality assessment: Creating multl -level 
evaluation systems . integrating quality Information, valid Inferences, 
and multi-level and multl -Institutional contexts Into a set of operating 
systems Is a tall order. To recap the discussion thus far, features of 
such a Ideal system would consist of valid Information Including student 
achievement measures (using a variety of methods and formats), and an 
expanded set of Indicators of school context, processes, resources, and 
non-achievement outcomes. Functionally, valid Inferences meant be drawn by 
Integrating measures Into valid composlt Indicators, Interpreting 
Information In the light of the specific and multi-level context(s). Last, 
the system would provide comparisons against multiple criteria. The 
Intent of this system would be to generate ways to evaluate educational 
policies and practices, and would contribute to their amendment and 
Improvement. Clearly the nature of the educational system precludes a lock 
step development of even an approximation of such an evaluation system. It 
would certainly be naive to expect, for Instance, that the Imposition of a 
particular set of state level standards of testing and evaluation 
requirements would have uniform or generally consistent effects as the 
Intent of policy was successively reinterpreted at lower levels of 
educational organization. 

The abstraction of a complicated system takes on unexpected forms of 
reality as real Implementation Is addressed. Our Intentions In exploring 
the design of multilevel systems Involve a dual focus on the technical 
quality of the Intervention and on the local realities that contribute or 
Impede the Implementation of Innovation (Hathaway, 1985; Cooley and Blckel, 
1985; Slrotnlk and Bursteln, 1985; Bank and Williams, 1984a; Herman, 1985; 
Dorr-Bremme, 1983). When grand plans confront habits of dally decision- 
making In classrooms and schools, grand plans often crumble. Thus, In our 
own efforts we Intend to provide opportunity for local participants, 
Including teachers, school principals as well as district and state 
managers to have serious Influence on the shape of these evaluation 
systems. He hope to balance, In fact, the locus of evaluation systems at 
the local level (bottom-up) with the clear requirements of state and 
national policy. We also Intend to conduct Intensive studies of 
Implementation so that our efforts may be successively adapted to work to 
the satisfaction of the research scholars, the policymakers, and the people 
who conduct the day to day business of teaching and learning. We would 
expect such preliminary systems to Incorporate the best R&D available, 
from whatever source, In their systems. We would expect these systems to 
function In a formative or Improvement-oriented manner. Should the systems 
have merit, we would then wish to assess their Impact as Interventions 
affecting educational quality. 
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Utility and Impact 

The foregoing discussion of the EQIP model, Its contexts and Its 
requirements for quality Information and Inferences Is Incomplete. The 
power of our formulation also must be judged In light of Its utility and 
Impact In the real world of schools. Successful application of procedures 
derived from such a framework will depend on less technical concerns. One 
of these Is the utility of the Information generated by testing and 
evaluation processes. 

Utility . Utility can be analyzed In at least two ways: perceived 
utility and objective utility. Perceived utility resides In the eye of the 
user. Information can be thought to be useful, described as Influential In 
ways of thinking about problems or In actual decisionmaking. Information 
may provide clear guidance related to a particular purpose or shed light In 
an unexpected way on an unresolved Issue. In this area, we depend upon 
reports of Individuals regarding usefulness, or Infer utility from the 
Ideas held, language used to express Ideas or actions related to extint 
Information (Glass, 1972; Weiss, 1977). 

Objective utility Involves the analysis of consequences of Information 
for decisionmaking. In some sense It Is a reverse engineering problem, a 
problem of tracking back from decisions and attributing partial causes to 
related Information. This process Is laborious and uncertain In the light 
of the weak links In chains of decisions and because rationalization of 
decisions Is a part of organizations and policymakers everywhere. This 
analysis process also provides a distorted view of the ways Information 
likely affects decisionmaking, not at all as systematically and neatly as 
In an experimental research paradigm with clear treatments, periods of 
Implementation, and crystalline findings. Rather, quality Information gets 
used Irregularly, In combination with Informal sources and beliefs and on a 
lurch and languish schedule. Research related to evaluation and knowledge 
utilization (Weiss, 1972, 1977; Pelz, 1985; Alkln et al, 1985) Is pertinent 
here. We also assess the utility of Information In terms of Its 
conformance (construct validity) to findings In related areas, the extent 
to which Information confirms trends from other data sources or can be 
thought to Illuminate new courses of action 

Impact . Objective utility then links the available Information base, 
the Inferences drawn from It to a set of decisions. Another, tougher 
question Involves the result of the decision. What Is Its Impact? Baldly 
put, did the formulation and application of testing and evaluation have 
Impact? We have all learned, living with pollution, asbestos, food 
additives, and so on, that outcomes can be both positive and negative, that 
planned good can turn Into evil. So our study of Impact Is goal-free 
(Scrlven, 1974) and deals with both benefit and loss. We do not see the 
study of the Impact of testing and evaluation to be an Interesting 
sidelight. Nor, we are quick to say, can we Imagine such studies to be 
anything close to direct tests of the concepts of quality Information and 
valid Inferences. But It Is responsible to close the loop. We must not 
stop with analyses, with research Ideas that contribute only to the 
generation of other research Ideas. We must use the noisy and Imprecise 
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Information available from targeted Impact studies of policy Interventions 
as a basis for reexamining our views, our research plans, and our Intended 
accomplishments. Because educational testing and evaluation have their 
power as applications In practice, we must describe and report their 
effects as they occur. All of this effort, however, Is undertaken with no 
small measure of modesty. Research-based knowledge has a strong 
contribution to make, but Is nowhere near sufficient. Our programs will 
succeed If they strengthen the knowledge base underlying practical 
day-to-day decisions. In the longer term, the spread of technology may 
make this utilization problem more tractable and the predicted effects more 
optimistic. 

How can concerns for utility and Impact be considered within an R&D 
program? They require Intensive and multlfacted study of the effects of 
testing and evaluation. We plan, therefore, to devote attention to testing 
and evaluation not only as ways to assess the system, as Interventions 
themselves Intended to raise standards and to Improve educational policies 
and practices. 

Summary 

Our conceptual framework addresses the Issue of educational quality 
assessment within the complex contexts of American education and provides 
the backdrop for the CSTES research and development program. CSTES staff 
Is committed to explore the use of testing and evaluation to Improve 
educational policies and practices at all levels of the educational 
system. Second, we are Interested In testing and evaluation (assessment) 
methods which Incorporate Implicit and explicit expectations for education 
and which provide a more complete and accurate picture of educational 
quality. Third, wa are Interested In Integrated judgments of educational 
quality, Integrations made horizontally across various dimensions of 
educational quality, vertically, both up and down across levels of the 
educational system, and longitudinally, across Institutions serving 
different ages of the population. Fourth, we are concerned with the 
usefulness and use of assessment In support of Improved educational policy 
and practices. 

Goals for CSTES 

This orientation to educational quality assessment and Improvement 
leads directly to the explication of CSTES goals. Inherent In our 
conceptual framework are the two Institutional goals to which our work will 
be directed. 

TO CONTRIBUTE TO THEORY AND PRACTICE UNDERLYING THE ASSESSMENT OF 
EDUCATIONAL QUALITY; and 

TO CONTRIBUTE TO THE IMPROVEMENT OF EDUCATIONAL QUALITY ITSELF. 

To accomplish these goals, we will focus particularly on five major 
objectives . The first three are derived directly from our conceptual 
framework the final two support critical R&D strategies: 
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1. TO IMPROVE THE VALIDITY OF STUDENT PERFORMANCE MEASURES BY: 

o Improving the content base of measures; 

o Improving the usefulness of measures In Instructional settings; 

o broadening approaches to assessing student performance to Increase 
their fairness and utility; 

o Integrating research In human cognitive processing and 
assessment; and 

o exploring the applications of technology for test development, 
administration, and analysis. 

2. TO IMPROVE THE VALIDITY OF INFERENCES ABOUT EDUCATIONAL QUALITY BY: 

o developing methods for articulating Information vertically In 
Institutional and organizational contexts; 

o expanding the band of Indicators beyond traditional measures of 
student performance; 

o Integrating a variety of measures to provide a better picture of 
educational quality of precolleglate and postsecondary educational 
services and outcomes; 

o conducting analyses of the conceptual and theoretical underpinnings 
of the evaluation process; and 

o exploring the organizational and technical requirements for 
multilevel evaluation systems. 

3. TO EVALUATE THE IMPACT OF STATE AND LOCAL POLICY REFORM IN AREAS OF 
TESTING AND EVALUATION ON EDUCATIONAL QUALITY BY: 

o tracking International, national, state and local policy reforms In 
testing and evaluation for precolleglate and postsecondary 
educational systems; 

o analyzing problems, promising claims, and effects and regularly 
reporting these to policy, practitioner, parent, and community 
constituencies; studying the effects of particular testing and 
evaluation policies on educational standards, quality of school 
life, and public perceptions to determine If such reforms have their 
Intended results; and 

o analyzing, In particular, the effects of testing and evaluation on 
populations with special needs. 

4. TO DISSEMINATE THE RESULTS OF OUR RAD TO A HIDE RANGE OF AUDIENCES 
AND TO HELP FACILITATE THEIR IMPACT ON THE FIELD BY: 
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o collaborating closely with stakeholders In testing and evaluation 
utilization and with the R&D community throughout the entire R&D 
process; and 

o disseminating vigorously the results of our research through a 
variety of media and through a wider network of researchers, 
practitioners and policymakers. 

5. TO SET THE RESEARCH AGENDA FDR THE FIELD OF EDUCATIONAL TESTING AND 
EVALUATION AND ASSURE IT HILL CONTRIBUTE TO NATIONAL EDUCATIONAL 
PRACTICE. 
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CHAPTER THREE: Futures Paper 

A Center for Student testing, Evaluation and Standards: 
Assessing and Improving Educational Quality 

The proposed NIE Center on Student Testing, Evaluation, and Standards 
will conduct research designed to Improve the quality of testing and 
evaluation practices, seeking to Increase their contribution to educational 
excellence and equity, their Impact on local school Improvement, and their 
role In enlightened policy making. Central to our approach Is the belief 
that evaluation and testing can contribute significantly to educational 
quality and to planning and decision-making at all levels of the 
educational enterprise: from the Individual student through the classroom, 
school, district, state, and federal levels. If they are to have such an 
Impact, however, testing and evaluation must be sensitive to the 
complexities and realities of the schooling process, to the local and 
regional character of education, and to the multiplicity of constituencies 
who have a stake In education and Its evaluation. 

The CSTES represents a unique collaborative effort to advance theory 
and practice In the mission area. A creative national organizational 
structure Is proposed which brings together leading researchers from the 
UCLA Center for the Study of Evaluation, from the University of Illinois, 
from the University of Colorado, from the National Opinion Research Center 
at the University of Chicago, and from Educational Testing Service to work 
on pressing educational problems. The utility and Impact of the research 
program will benefit not only from the multl disciplinary perspectives of 
this prestigious group but also from the active collaboration of prominent 
practitioners and policymakers from across the country at all levels of the 
educational system — school, district, state, and national. These 
collaborative arrangements will help to assure a targeted R&D program which 
contributes significantly to both knowledge production and to knowledge 
utilization. 

Guiding Premises 

Collaborators In the CSTES proposal have well-established credentials 
In the mission area and extensive experience In working together. The 
research agenda we proposed Is guided by our shared belief In the 
Importance of testing and evaluation In Improving schools and In Informing 
sound public policy. A number of premises are central to our approach: 

o He believe that testing, evaluation, and standard setting can 
contribute to Improving the quality of education . Tests — when 
they are well conceived, constructed, administered, and analyzed — 
can provide valuable Insights Into how Individuals and classes of 
students are learning; they can help guide teaching, 
administration, and policymaking within our educational 
Institutions. Evaluations of programs — especially when they are 
seen as Improvement oriented, locally useful, and Iterative — can 
help to guide the reallocation of resources, the modification and 
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Improvement of activities, and the retraining of personnel. 
Standards — set with due attention both to what Is desirable and 
to what Is feasible at the state and local levels — can help to 
focus attention and promote accountability for educational 
Improvement. 

o We believe that testing and evaluation are Important tools for 
promoting educational equity . Tests, when they are sensitive to 
Individual differences and preferences In learning styles, provide 
a powerful means for diagnosing students' unique needs and 
providing effective Instruction for all students. Furthermore, 
tests, when they match classroom Instruction, can provide fair and 
equitable measures of student progress, measures which focus on 
learning accomplishments rather than background characteristics. 
Achievement measures as well as measures of educational processes 
and community context, can help to Identify areas where the needs 
of particular groups are being met and where more attention Is 
needed, facilitating more effective programs for all. 

o We believe that testing and evaluation should serve the needs of a 
multiplicity of users . Teachers may need test and evaluation 
information to make instructional decisions; and local school and 
district administrators, as well as policymakers at the state and 
federal levels, need such Information to guide their planning and 
decisionmaking. If they are to be useful In supporting and 
Improving schools, evaluation and testing activities should be 
decentralized to the local level, while at the same time 
maintaining their utility for addressing legitimate public policy 
concerns at state levels In particular. 

o We believe that testing, evaluation, and standard setting are 
endeavors which are partly technical, partly political, and partly 
social. Technical expertise is essential In test development and 
analysis, to ensure the valid and reliable use of test results; 
social understanding Is essential to ensure fairness and utility. 
Similarly, evaluation questions arise out of people's Information 
requirements, while the design and Interpretation of evaluations 
depend on technical competence. The definition of standards 
depends on values and consensus; the measurement of their 
attainment Involves technical considerations. 

While we are optimistic about the potential of educational testing and 
evaluation, we also are aware of their current shortcomings, cognizant of 
their potential misuses, and sensitive to their possible unintended 
effects. We believe that a national center must play a vigilant role with 
regard to these concerns and functions as a consumer advocate to the field, 
analyz'ng current practices and Informing public policy. 

Problems In Practice 

Research In educational testing and evaluation has made Important 
strides In the last decade and Its methodologies hold great promise for 
Improving the state of education. Nonetheless, significant problems remain 
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In educational practice, problems related to the quality and diversity of 
existing measures, to the validity of the Inferences that can be derived 
from these measures, and problems related to their utility to and Impact on 
the educational system. The following examples Illustrate the variety of 
existing problems within each of these Interconnected areas. 

Problems related to quality of Information . 

1. Most of the testing and evaluation procedures currently used to 
assess students, programs and schools cover only a narrow range of the 
knowledge and skills that are the targets of schooling and do so without 
adequate attention to the nature of these knowledges and skills. For 
example: 

o The National Council of Teachers of English have long decried 
reliance on multiple choice tests as measures of writing skills. 
Associations of teachers of mathematics, of social studies, and of 
science have similarly criticized the content of existing tests 
and the levels of achievement which are assessed. 

o In the push to Implement new testing programs, some states and 
school districts have paid more attention to new psychometric 
techniques than to the knowledge domain being assessed and Its 
cognitive underpinnings. 

2. Given what Is known about testing and evaluation design, tests 
tend to be of poor quality. For example: 

o The testing materials most commonly used by teachers, e.g., 
end-of-chapter tests, are often extraordinarily poor. They can 
mislead the teacher Into believing that students have learned 
when, In fact, they have not; or that remedial exercises are 
needed when, In fact, more advanced materials would help to 
enhance learning. 

o The bells and whistles of the computer revolution and Its slick 
print-outs often give an undeserved aura of scientific rigor to 
score reports. What the reports fall to convey Is the 
arbitrariness of many classifications (e.g., "mastered" vs. 
"failed to master") and the poor reliability of the Information, 
which may be based on only two or three items per skill. 

3. Bias In the assessment of achievement for special groups Is a 
continuing problem. For example: 

o While concerns for bias have alleviated many problems of 

stereotyping, teachers report that many formal tests are unfair 
for their students. 

o Sophisticated psychometric techniques have been developed to 
Identify biased Items but the source of the Identified bias 
often remains unknown. 
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4. The quality of measures at the post-secondary level Is 
particularly problematic. For example: 

o College admission measures serve as the primary Indicator of the 
entire precolleglate system, Ignoring other Important outcomes and 
alternate postsecondary experiences. These measures, In addition, 
are not well articulated with either precolleglate curriculum or 
with post-secondary course offerings. 

o Testing has made Its entrance 1n the collegiate environment In 
narrow enclaves: dealing with "underprepared," often minority 
students, In courses designed to ready students for college level 
work; less frequently In qualifying exit examinations related to 
writing or mathematics performance. But the larger question of 
the effects of higher education on Intellectual growth and on 
preparation are Inferred from patterns of course enrollment and 
grade point averages. 

« 

Problems related to quality of Inferences . 

5. Most testing programs and evaluation systems devote scant 
attention to the mediating factors, e.g., the quality of educational 
processes, background variables, and other contextual characteristics, 
which are basic to understanding student performance. For example: 

o Every year, a metropolitan newspaper In California ranks schools 
In terms of their students' scores on achievement tests. Missing 
from these public reports is any consideration of the factors that 
may explain differences or changes In rank, such as a sudden 
Influx of children from different language backgrounds. 

o High student mobility rates may obscure a given school's quality 
of effort. Thus, In large urban school districts, only 40 percent 
of the children who enter a particular school In the fall will 
still be attending that school In June, and absence rates may run 
as high as 50 percent every day. But public evaluation documents 
almost never mention these factors. 

6. The Federal concern for developing a National Report Card 
underscores the need for state and national level Indicators of overall 
educational quality, but many problems remain. For example: 

o The component Indicators of quality receive considerable attention 
but tend to focus on grossly, uncertainly defined but more easily 
accessed datasets of macro variables, e.g., dropout, student 
achievement" data (like the SAT examination), teacher academic 
history. Neglected Is the broad picture of Input, process, and 
outcome Indicators which might provide the critical context for 
understanding and judging comparative quality. 

o Potential sources of valid student performance data exist In 
ongoing state assessment programs, for Instance, but 
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Investigations of means for aggregating such Information are only 
just underway for state by state comparisons. The Importance of 
test content receives less attention. 

o The Idea of a national test to estimate overall national system 
performance recurs periodically, with the National Assessment of 
Educational Progress the current version of the Idea. Scant 
attention has been paid to costs and benefits of linking existing 
assessment systems to create national Indicators. 

7. Concern for student achievement and the quality of American 
education escalates each time an International comparison of student 
performance Is conducted. Yet there has been little consideration of the 
use of International studies, or the measures generated by them, as 
benchmarks to protect America's ability to compete In technological, 
academic, and economic futures. For example: 

o The Second International Mathematlc Study provided a comparison of 
the United States and 20 other countries. Results show that the 
United States performed relatively poorly In comparison with 
Japan. Less serious consideration was given to the meaning of 
these data with respect to the role that content coverage, the 
quality of Instruction, or the differences In background, 
abilities, and attitudes might play In the highlighted performance 
differences, although data are available on these student and 
Instructional characteristics are available. 

8. Because different types of decisions (e.g., policy, Institutional, 
Instructional, counseling) require different types of Information, a 
patchwork system for collecting Information has been created. Not only are 
the testing and evaluation procedures used unnecessarily Intrusive, but the 
Information produced Is overly redundant. The redundancy may be 
particularly acute for special populations. For example: 

o Children participating In a Chapter I program at a mldwestern 
school must take the CTBS In the fall and again In the spring, In 
addition to mandated state assessment tests, a dlstrlctwlde 
norm-referenced test, and an array of curriculum-embedded tests. 
The Information from these tests Is never Integrated Is largely 
redundant, and only tangentlally Influences teaching practices. 

Problems related to utility and Impact . 

9. Student testing programs on which much of evaluation depends, are 
externally Imposed, from the top-down, but the use of data for local school 
Improvement Is a bottom-up proposition, local and specific In nature. The 
result Is data of limited utility for teachers and school administrators. 
For example: 

o Extensive Interviews with district administrators, principals, and 
teachers In one mldwestern school district found that while each 
of these groups believed the tests had value for the system as a 
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whole, tach group also said the tests ware not germane to Its own 
needs. Thus, district administrators said that tests were helpful 
to teachers; teachers thought them useful to principals and 
principals felt they were essential to district administrators. 
In short, no group acknowledged that It found such Information 
valuable. 

o According to a national study of teachers' use of testing, 
teachers reported very little practical decisionmaking based on 
formal testing because of the mismatch of test content and 
Instruction, poor reporting formats, and Inappropriate timing of 
results. 

10. Schools are supposed to be vehicles of social mobility and equity, 
giving all students n opportunity to achieve and to reap the benefits of 
productive partldpai an In society. Although rigorous testing systems are 
supposed to contribute to this process, evidence suggests that testing may 
actually Impede social mobility. For example: 

o According to a prestigious national study of schooling, testing 
has contributed to the tracking of students Into rigid vocational 
and academic lines, thereby reducing the prospects for Individual 
growth and satisfaction. 

o The treatment of special populations (e.g., children from 
different language backgrounds or with different developmental 
histories) often amounts to placement In dead-end tracks with 
little opportunity for change or advancement. 

11. Tests and evaluation are regarded not only as processes for 
assessing educational quality, but as significant Interventions In 
themselves that will promote excellence and high standards. There 1s 
widespread belief that the Imposition of testing systems will focus and 
motivate learning, but other effects contrary to excellence may also 
accrue. For example: 

o One eastern school district, echoing teachers' concerns In a 
national study, reported substantial narrowing of the curriculum, 
away from science, art, history and higher level skills and toward 
the basic skill areas assessed on mandated tests. 

o Districts around the country are Investing resources to train 
children In test-taking that could be allocated to encouraging 
subject matter learning; teaching to the test Is a common 
occurrence. 

o Acceptable pass rates are a political necessity, resulting In 
cut-scores that reflect neither excellence nor even minimum 
competency. 
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These three problem clusters, quality of Information, quality of 
Inferences and Interpretation, and utility and Impact of testing and 
evaluation reforms are central to our problem-focused R&D program. 
Although better Instruments, better Interpretations and better 
understandings of the consequences of testing and evaluation are demanded, 
the need runs much deeper. The problems are social and ep1 sterol oglcal as 
well as technical. 

Problem-focused Research Programs 

The conceptual framework defining the CSTES research agenda reflects 
these perspectives, emphasizing the role of Information In Improving 
educational quality and the need for better Information about educational 
quality to facilitate that Improvement process. The three research 
programs derived from the framework reflect areas where significant 
problems exist In practice and where both steady and Identifiable progress 
can be made. 

1. The Testing for the Improvement of Learning Program (Testing) 
focuses research attention on the design of measures of student learning 
processes and achievement so that test Information can be used to Improve 
Instruction and performance. The program emphasis Is on Improving the 
quality and validity of measures of student performance and their utility 
In meeting students' Instructional needs. Conceptual syntheses, 
theoretically-based empirical studies and exploratory research and 
development of content based measures at the precolleglate and 
postsecondary levels are planned. These projects address the primary 
program objective of Improving the validity of student performance measures 
by: Improving the content base of measures; Improving the usefulness of 
measures for multiple Instructional purposes; broadening approaches to 
assessing student performance to Increase their fairness and utility; 
Integrating research In human cognitive processing and In assessment; and 
exploring the applications of technology for test development, 
administration, and analysis. 

2. The Systems for Evaluating and Improving Educational Quality 
Program (Evaluation) Is designed to strengthen methodologies for using 
evaluation to Improve educational quality. It seeks to decentralize 
evaluation systems to the local school level where they can help teachers 
and school administrators to understand their problems and better meet the 
Instructional needs of students while at the same time accommodating the 
Information needs of local and state policymakers. Conceptual synthses, 
field-based empirical studies, and research and development projects are 
proposed to accomplish the primary program objective: To Improve the 
validity of Inferences about educational quality by developing 
methodologies for articulating Information needs at the various levels of 
the educational system; by expanding the band of Indicators used to under- 
stand and judge quality; by Integrating a variety of measures to provide a 
better picture of educational quality at the precolleglate and post- 
secondary levels; by exploring the organizational and technical require- 
ments for multilevel evaluation systems; and by conducting analyses of the 
conceptual and theoretical underpinnings of the evaluation process. 
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3. The Impact of Testing and Evaluation on Educational Standards , 
Policy and Practice Program (Impact) seeks to examine the actual effects of 
testing and evaluation on educational quality and their role In promoting 
excellence and equity. The program will also monitor and analyze the 
Implementation and quality of new test and evaluation developments on the 
national level, particularly as they serve as measures of educational 
reform. The results of the program will provide significant Information 
for educational policymakers responsible for the design of educational 
programs. The program also Is designed to assess and facilitate the Impact 
of CSTES research and development on educational policy and practice and to 
serve a needs assessment function for future R&D. 

The three programs are designed to Interact and to support the 
underlying reason for a center of research and development rather than 
support for Individual products. Explorations In the Testing Program will 
influence the types of measures used In the evaluation systems studied In 
the Evaluation Program. Feedback about effects or Identification of 
promising practices obtained In the Impact Program can affect both goals 
and research plans In both the Testing and Evaluation Programs. Productive 
findings In the Testing and Evaluation Programs should, in the long run, 
show their effects In the work of the Impact Program. 

Planned Institutional function activities Incorporate a number of 
strategies to assure that the results of the research programs are widely 
disseminated to Intended audiences — teachers; school, district, and state 
administrators; state and local policymakers; test publishers; and other 
researchers ~ and that they Influence future educational research, policy 
and practice. 
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